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Abstract 

Unlike case-control studies, family-based tests for association are protected against population stratification. 
Complex genetic traits are often governed by quantitative precursors and it has been argued that it may be a 
more powerful strategy to analyze these quantitative precursors instead of the clinical end point trait. Although 
methods have been developed for family-based association tests for single quantitative traits, it is of interest to 
develop such methods for multivariate phenotypes. We propose a novel transmission-based approach based on a 
trio design using a simple logistic regression to test for association with a multivariate phenotype. We use our 
proposed method to analyze data on systolic and diastolic blood pressure levels provided in Genetic Analysis 
Workshop 18. However, we find that the bivariate analysis of the two phenotypes did not provide more promising 
results compared to univariate analyses, suggesting a possibility of a different set of major genetic variants 
modulating the two phenotypes. 



Background 

The family-based design [1] for detecting association is a 
popular alternative to population-based case-control stu- 
dies since it circumvents the problem of population 
stratification. Moreover, in spite of successful identifica- 
tion of a large number of common variants associated 
in various complex traits, the proportion of total varia- 
tion in a trait explained by these variants has been mini- 
mal and has motivated a search for rare variants that 
could explain the "missing heritability". Because rare 
variants are likely to be more frequent in large families 
compared to the general population, it may be a more 
prudent strategy to test for transmission disequilibrium 
in pedigrees to identify these variants. Although trans- 
mission-based tests for association of both binary and 
quantitative traits have been extensively studied [1-4], 
extension of such tests for multivariate phenotypes is of 
current research interest. We have developed a compu- 
tationally simple logistic regression-based test that 
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models the probability of transmission of the minor 
allele at a single-nucleotide polymorphism (SNP) from a 
heterozygous parent conditioned on the multivariate 
phenotype values of the offspring. We apply our pro- 
posed method to analyze systolic and diastolic blood 
pressure levels in a pedigree using longitudinal data 
over four time points provided in Genetic Analysis 
Workshop 18 (GAW18). 

Data description 

For our analyses, we use pedigree data on systolic blood 
pressure (SBP) levels and diastolic blood pressure (DBP) 
levels at four different time points for 453 individuals 
along with their genotypes at all of the available 456,752 
variant sites distributed over 11 autosomal chromosomes. 
In addition to age, we used smoking status and medication 
indicator (both defined as binary variables) at each time 
point of examination as covariates, as these factors could 
be potential confounders in the association analyses. Both 
the SBPand the DBP levels were adjusted for these covari- 
ates for each time point and the tests for transmission dis- 
equilibrium were performed on the adjusted phenotypes. 
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Methods 

Statistical methodology 

Imputation of missing phenotype values and covariate 
adjustment 

Data on the two phenotypes and the covariates are not 
available for all individuals at every time point. The 
assumption of multivariate normality provides a computa- 
tionally elegant framework for the expectation maximiza- 
tion (EM) algorithm [5] to estimate parameters when data 
are missing. Blood pressure levels have traditionally been 
believed to follow a lognormal distribution. Althoughthe 
Kolmogorov-Smirnov test did not show any significant 
departure from normality for the SBP and DBP levels at 
any of the time points, some of the /j-values are very close 
to the threshold of 0.05. We thus perform a logarithmic 
transformation on each of the phenotypes to induce nor- 
mality. We use an unrelated set of 142 individuals from 
the pedigrees for whom data on all the variables are avail- 
able to estimate the missing log-transformed phenotype 
values using data on the available phenotype values. Sup- 
pose the vector of log-transformed values of any of the 
two phenotypes at the four time points is represented as X 
= (X lt X 2 , X 3 , X 4 ). If Y denotes the vector comprising 
those components of X that are missing and Zis the com- 
ponents that are available for an individual, Yis estimated 
via an EM algorithm as the expectation of Y conditioned 
on Z and is given by u Y £yz £ zz -1 (Z-n z ), where, u Y and 
u z are the mean vectors of Y and Z, respectively; ~L Y z is 
the matrix of covariance between Y and Z, while S zz is 
the dispersion matrix of Z. We perform a linear regression 
of the log-transformed values of each of the two pheno- 
types (available as well as imputed) at each time point on 
age, smoking status, and medication indicator. We plug-in 
the parameter estimates of the mean vector and variance- 
covariance matrix of the log-transformed phenotypes 
obtained via the EM algorithm to estimate the missing 
log-transformed values of each phenotype conditioned on 
the available log-transformed values of that phenotype at 
every time point for the remaining individuals in the pedi- 
gree. We then use the regression equation at each time 
point to obtain the residuals for all individuals in the pedi- 
gree for whom data are available on all the covariates. 
Test for transmission disequilibrium using logistic regression 
The phenotypes for our association analyses are the 
adjusted SBP and DBP levels at each time point 
obtained using the algorithm described in the preceding 
section. We use a novel binary logistic regression frame- 
work to test for association of a SNP with a multivariate 
phenotype. For each SNP, we consider all trios in the 
pedigree with at least 1 heterozygous parent at that 
SNP, selecting one sib at random from each sibship. 
Suppose X = (Xi, X 2 , X 3 , ...,X k ) denotes a vector of k 
phenotypes and Wis anindicator random variable (1 or 
0) denoting whether a heterozygous parent at a SNP 



transmits the minor allele or not. We model the condi- 
tional distribution of W given X using a logistic link 
function given by: 



P{W=l\X 1 ,X 2 ,...,X k ) 



1 + exp{p 0 + £li frfr - m)} 



where, Uj is the mean of X t in the population thatis 
estimated by the sample mean and the parameters 
Po, Pi, Pi, •••/ Pk are estimated using the method of maxi- 
mum likelihood. 

We note that even though this model is in similar lines 
as Waldman [6], it captures the pattern of transmission 
disequilibrium in a more optimal fashion as the pheno- 
types are corrected for their means, making this model 
more powerful. The test for transmission disequilibrium is 
equivalent to testing H 0 : $!=$2 = ••• =Pk= 0 versus H x : not 
H 0 and the log-likelihood ratio test statistic is distributed 
as chi-squares with k degrees of freedom under the null 
hypothesis. We compare the relative performances of 
3phenotype vectors in detecting association: (a) 7\: the 
adjusted SBP levels summarized by the first two principal 
components across the four time points; (b) T 2 : the 
adjusted DBP levels summarized by the first two principal 
components across the four time points; and (c) T 3 : a 
bivariate phenotype comprising the adjusted SBPand the 
adjusted DBP levels summarized by the first two principal 
components corresponding to each of the phenotypes 
across the four time points. The above choice of principal 
components is motivated by the fact that 75% of the varia- 
tion in each of the two phenotypes is explained by the cor- 
responding first two principal components. To correct for 
multiple testing, we use the false discovery rate procedure 
[7] with an overall rate of 0.05. 

Results 

The pedigree is made up of 95 distinct pairs of parents. 
Thus, our transmission disequilibrium analyses are 
based on 95 independent trios. Given that most parents 
have multiple offspring, there exists a large number of 
possible sets of trios if 1 sib is selected at random from 
each sibship made up of two or more sibs. We consider 
1000 such possible sets of trios at random. Because 
transmissions only from heterozygous parents are rele- 
vant for the proposed test for transmission disequili- 
brium, we analyze only those SNPs that are made up of 
at least 25 informative trios for efficient estimation of 
parameters in the logistic regression. We also exclude 
those SNPs that show significant deviation from the 
Hardy- Weinberg equilibrium based on the unrelated set 
of 139 individuals for whom genotype data are available, 
and use Bonferroni correction for multiple testing. 

The tests for association based on the proposed logistic 
regression are carried out on 426,193 SNPs. Among the 
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phenotype vectors considered, contrary to our expectation 
that T 3 (the phenotype made up of the first two principal 
components of both SBP and DBP levels) would be more 
powerful in detecting association, T± (the phenotype made 
up ofthe first two principal components of SBP levels) pro- 
vides the most promising association finding. The SNPs 
rs4754220 and rsl2419678 on chromosome 11 attains 
genome-wide significance (based on the desired false dis- 
covery rate of 0.05) with 7\ in 37 and 35 of the 1000 sets 
of trios, respectively. On the other hand, the SNP 
rsl3301156 on chromosome 9 exhibits significant evidence 
of transmission disequilibrium with T 2 (first two principal 
components of DBP) in 24 sets of trios. These three SNPs 
also rank among the top five SNPs significantly associated 
with T 3 , although in less than 10 sets of trios. 

Conclusions 

We have developed a simple binary logistic regression 
model that incorporates multiple phenotypes for transmis- 
sion-based association analyses of the multivariate pheno- 
type vector. The method does not involve any modeling of 
the correlation structure within the components of the 
multivariate phenotype as required in likelihood-based 
approaches and, consequently, is more robust with respect 
to distributional assumptions. On the other hand, the 
method does not reduce the multivariate phenotype vector 
to principal components, thus circumventing the problem 
of biological interpretations of derived phenotypes. 

The SNPs rs47S4220 and rs 1241 9678 that exhibited 
the most significant evidence of linkage disequilibrium 
with SBP values are located in the intronic region of the 
gene CWF19L2 (CWF19-like 2, cell-cycle control) on 
llq22.3. Studies show that RNA expression of this gene 
is upregulated in humans for inflammatory cardiomyo- 
pathy [8]. On the other hand, the SNP rsl3301156 that 
yields significant evidence of association with DBP levels 
is located in the intergenic region between the genes 
RPS6P13 (ribosomal protein S6 pseudogenel3) and 
GAS1 (growth arrest-specific 1) on 9q21.3. The RNA 
expression of RPS6P13 has been reported to bedownre- 
gulated in humans for coronary collateralization [9], 
while the RNA expression in GAS1 has been reported to 
be upregulated for arrhythmogenic right ventricular car- 
diomyopathy in humans [10]. 

It is expected that if a genetic variant modulates multiple 
phenotypes, a multivariate analysis will be more powerful 
than separate univariate analyses in detecting association 
with the genetic variant. However, we find that the associa- 
tion test for the bivariate phenotype is less powerful than 
the tests for SBP levels and DBP levels separately. More- 
over, the most significant association findings obtained for 
the bivariate phenotype form a disjoint union of those 
obtained for the two phenotypes separately. Consequently, 
it is possible that althoughthere may be common genes 



modulating both SBP and DBP levels, the major genetic 
variants for the two phenotypes may be different and 
the bivariate phenotype contains minimal additional 
information on the variants compared to any of the two 
phenotypes. 

The proposed transmission-based association test can 
incorporate multiple sibs within a sibship by considering 
the transmission to each sib separately. However, such a 
test is strictly a valid test only for linkage. Although the 
presence of association increases the power to detect 
transmission disequilibrium, the rejection of the null 
hypothesis does not necessarily imply the presence of link- 
age disequilibrium. When we perform our proposed test 
with all sibs within each sibship, we obtain large clusters of 
significant SNPs since linkage exists over much larger dis- 
tances on the genome compared to linkage disequilibrium. 
However, we find that the clusters on chromosomes 9 and 
11 include the three SNPs that provided the most signifi- 
cant evidence of association. We are currently exploring 
the theoretical properties of various methods to integrate 
the test statistics (such as the mean or the maximum order 
statistic) for the different sets of trios (considering lsib at 
random from each sibship) into a combined test statistic. 
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